11 research outputs found
Guiding Quality Assurance Through Context Aware Learning
Software Testing is a quality control activity that, in addition to finding flaws or bugs, provides confidence in the software’s correctness. The quality of the developed software depends on the strength of its test suite. Mutation Testing has shown that it effectively guides in improving the test suite’s strength. Mutation is a test adequacy criterion in which test requirements are represented by mutants. Mutants are slight syntactic modifications of the original program that aim to introduce semantic deviations (from the original program) necessitating the testers to design tests to kill these mutants, i.e., to distinguish the observable behavior between a mutant and the original program. This process of designing tests to kill a mutant is iteratively performed for the entire mutant set, which results in augmenting the test suite, hence improving its strength.
Although mutation testing is empirically validated, a key issue is that its application is expensive due to the large number of low-utility mutants that it introduces. Some mutants cannot be even killed as they are functionally equivalent to the original program. To reduce the application cost, it is imperative to limit the number of mutants to those that are actually useful. Since it requires manual analysis and test executions to identify such mutants, there is a lack of an effective solution to the problem. Hence, it remains unclear how to mutate and test a code efficiently.
On the other hand, with the advancement in deep learning, several works in the literature recently focused on using it on source code to automate many nontrivial tasks including bug fixing, producing code comments, code completion, and program repair. The increasing utilization of deep learning is due to a combination of factors. The first is the vast availability of data to learn from, specifically source code in open-source repositories. The second is the availability of inexpensive hardware able to efficiently run deep learning infrastructures. The third and the most compelling is its ability to automatically learn the categorization of data by learning the code context through its hidden layer architecture, making it especially proficient in identifying features. Thus, we explore the possibility of employing deep learning to identify only useful mutants, in order to achieve a good trade-off between the invested effort and test effectiveness.
Hence, as our first contribution, this dissertation proposes Cerebro, a deep learning approach to statically select subsuming mutants based on the mutants’ surrounding code context. As subsuming mutants reside at the top of the subsumption hierarchy, test cases designed to only kill this minimal subset of mutants kill all the remaining mutants. Our evaluation of Cerebro demonstrates that it preserves the mutation testing benefits while limiting the application cost, i.e., reducing all cost factors such as equivalent mutants, mutant executions, and the mutants requiring analysis.
Apart from improving test suite strength, mutation testing has been proven useful in inferring software specifications. Software specifications aim at describing the software’s intended behavior and can be used to distinguish correct from incorrect software behaviors. Specification inference techniques aim at inferring assertions by generating and filtering candidate assertions through dynamic test executions and mutation testing. Due to the introduction of a large number of mutants during mutation testing such techniques are also computationally expensive, hence establishing a need for the selection of mutants that fit best for assertion inference. We refer to such mutants as Assertion Inferring Mutants. In our analysis, we find that the assertion inferring mutants are significantly different from the subsuming mutants. Thus, we explored the employability of deep learning to identify Assertion Inferring Mutants. Hence, as our second contribution, this dissertation proposes Seeker, a deep learning approach to statically select Assertion Inferring Mutants. Our evaluation demonstrates that Seeker enables an assertion inference capability comparable to the full mutation analysis while significantly limiting the execution cost.
In addition to testing software in general, a few works in the literature attempt to employ mutation testing to tackle security-related issues, due to the fault-based nature of the technique. These works propose mutation operators to convert non-vulnerable code to vulnerable by mimicking common security bugs. However, these pattern-based approaches have two major limitations. Firstly, the design of security-specific mutation operators is not trivial. It requires manual analysis and comprehension of the vulnerability classes. Secondly, these mutation operators can alter the program semantics in a manner that is not convincing for developers and is perceived as unrealistic, thereby hindering the usability of the method. On the other hand, with the release of powerful language models trained on large code corpus, e.g. CodeBERT, a new family of mutation testing tools has arisen with the promise to generate natural mutants. We study the extent to which the mutants produced by language models can semantically mimic the behavior of vulnerabilities aka Vulnerability-mimicking Mutants. Designed test cases failed by these mutants will also tackle mimicked vulnerabilities. In our analysis, we found that a very small subset of mutants is vulnerability-mimicking. Though, this set mimics more than half of the vulnerabilities in our dataset. Due to the absence of any defined features to identify vulnerability-mimicking mutants, as our third contribution, this dissertation introduces Mystique, a deep learning approach that automatically extracts features to identify vulnerability-mimicking mutants. Despite the scarcity, Mystique predicts vulnerability-mimicking mutants with a high prediction performance, demonstrating that their features can be automatically learned by deep learning models to statically predict these without the need of investing any effort in defining features.
Since our vulnerability-mimicking mutants cannot mimic all the vulnerabilities, we perceive that these mutants are not a complete representation of all the vulnerabilities and there exists a need for actual vulnerability prediction approaches. Although there exist many such approaches in the literature, their performance is limited due to a few factors. Firstly, vulnerabilities are fewer in comparison to software bugs, limiting the information one can learn from, which affects the prediction performance. Secondly, the existing approaches learn on both, vulnerable, and supposedly non-vulnerable components. This introduces an unavoidable noise in training data, i.e., components with no reported vulnerability are considered non-vulnerable during training, and hence, results in existing approaches performing poorly. We employed deep learning to automatically capture features related to vulnerabilities and explored if we can avoid learning on supposedly non-vulnerable components. Hence, as our final contribution, this dissertation proposes TROVON, a deep learning approach that learns only on components known to be vulnerable, thereby making no assumptions and bypassing the key problem faced by previous techniques. Our comparison of TROVON with existing techniques on security-critical open-source systems with historical vulnerabilities reported in the National Vulnerability Database (NVD) demonstrates that its prediction capability significantly outperforms the existing techniques
Learning To Predict Vulnerabilities From Vulnerability-Fixes: A Machine Translation Approach
Vulnerability prediction refers to the problem of identifying system
components that are most likely to be vulnerable. Typically, this problem is
tackled by training binary classifiers on historical data. Unfortunately,
recent research has shown that such approaches underperform due to the
following two reasons: a) the imbalanced nature of the problem, and b) the
inherently noisy historical data, i.e., most vulnerabilities are discovered
much later than they are introduced. This misleads classifiers as they learn to
recognize actual vulnerable components as non-vulnerable. To tackle these
issues, we propose TROVON, a technique that learns from known vulnerable
components rather than from vulnerable and non-vulnerable components, as
typically performed. We perform this by contrasting the known vulnerable, and
their respective fixed components. This way, TROVON manages to learn from the
things we know, i.e., vulnerabilities, hence reducing the effects of noisy and
unbalanced data. We evaluate TROVON by comparing it with existing techniques on
three security-critical open source systems, i.e., Linux Kernel, OpenSSL, and
Wireshark, with historical vulnerabilities that have been reported in the
National Vulnerability Database (NVD). Our evaluation demonstrates that the
prediction capability of TROVON significantly outperforms existing
vulnerability prediction techniques such as Software Metrics, Imports, Function
Calls, Text Mining, Devign, LSTM, and LSTM-RF with an improvement of 40.84% in
Matthews Correlation Coefficient (MCC) score under Clean Training Data
Settings, and an improvement of 35.52% under Realistic Training Data Settings
Syntactic Vs. Semantic similarity of Artificial and Real Faults in Mutation Testing Studies
Fault seeding is typically used in controlled studies to evaluate and compare
test techniques. Central to these techniques lies the hypothesis that
artificially seeded faults involve some form of realistic properties and thus
provide realistic experimental results. In an attempt to strengthen realism, a
recent line of research uses advanced machine learning techniques, such as deep
learning and Natural Language Processing (NLP), to seed faults that look like
(syntactically) real ones, implying that fault realism is related to syntactic
similarity. This raises the question of whether seeding syntactically similar
faults indeed results in semantically similar faults and more generally whether
syntactically dissimilar faults are far away (semantically) from the real ones.
We answer this question by employing 4 fault-seeding techniques (PiTest - a
popular mutation testing tool, IBIR - a tool with manually crafted fault
patterns, DeepMutation - a learning-based fault seeded framework and CodeBERT -
a novel mutation testing tool that use code embeddings) and demonstrate that
syntactic similarity does not reflect semantic similarity. We also show that
60%, 47%, 43%, and 7% of the real faults of Defects4J V2 are semantically
resembled by CodeBERT, PiTest, IBIR, and DeepMutation faults. We then perform
an objective comparison between the techniques and find that CodeBERT and
PiTest have similar fault detection capabilities that subsume IBIR and
DeepMutation, and that IBIR is the most cost-effective technique. Moreover, the
overall fault detection of PiTest, CodeBERT, IBIR, and DeepMutation was, on
average, 54%, 53%, 37%, and 7%
On Comparing Mutation Testing Tools through Learning-based Mutant Selection
Recently many mutation testing tools have been proposed that rely on bug-fix patterns and natural language models trained on large code corpus. As these tools operate fundamentally differently from the grammar-based traditional approaches, a question arises of how these tools compare in terms of 1) fault detection and 2) cost-effectiveness. Simultaneously, mutation testing research proposes mutant selection approaches based on machine learning to mitigate its application cost. This raises another question: How do the existing mutation testing tools compare when guided by mutant selection approaches? To answer these questions, we compare four existing tools – μBERT (uses pre-trained language model for fault seeding), IBIR (relies on inverted fix-patterns), DeepMutation (generates mutants by employing Neural Machine Translation) and PIT (ap- plies standard grammar-based rules) in terms of fault detection capability and cost-effectiveness, in conjunction with standard and deep learning based mutant selection strategies. Our results show that IBIR has the highest fault detection capability among the four tools; however, it is not the most cost-effective when considering different selection strategies. On the other hand, μBERT having a relatively lower fault detection capability, is the most cost-effective among the four tools. Our results also indicate that comparing mutation testing tools when using deep learning-based mutant selection strategies can lead to different conclusions than the standard mutant selection. For instance, our results demonstrate that combining μBERT with deep learning- based mutant selection yields 12% higher fault detection than the considered tools
Cerebro: Static Subsuming Mutant Selection
Mutation testing research has indicated that a major part of its application
cost is due to the large number of low utility mutants that it introduces.
Although previous research has identified this issue, no previous study has
proposed any effective solution to the problem. Thus, it remains unclear how to
mutate and test a given piece of code in a best effort way, i.e., achieving a
good trade-off between invested effort and test effectiveness. To achieve this,
we propose Cerebro, a machine learning approach that statically selects
subsuming mutants, i.e., the set of mutants that resides on the top of the
subsumption hierarchy, based on the mutants' surrounding code context. We
evaluate Cerebro using 48 and 10 programs written in C and Java, respectively,
and demonstrate that it preserves the mutation testing benefits while limiting
application cost, i.e., reduces all cost application factors such as equivalent
mutants, mutant executions, and the mutants requiring analysis. We demonstrate
that Cerebro has strong inter-project prediction ability, which is
significantly higher than two baseline methods, i.e., supervised learning on
features proposed by state-of-the-art, and random mutant selection. More
importantly, our results show that Cerebro's selected mutants lead to strong
tests that are respectively capable of killing 2 times higher than the number
of subsuming mutants killed by the baselines when selecting the same number of
mutants. At the same time, Cerebro reduces the cost-related factors, as it
selects, on average, 68% fewer equivalent mutants, while requiring 90% fewer
test executions than the baselines